## [1] 1599 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The mean and median for red wine quality is 5.636 and 6. From the distribution graph, few red wines are in either low quality or very high quality. The distribution appears to be Gaussian.
To simply our visulization, I created a categorical variable quality_level with levels [low, medium, high] to group the quality.
residual.sugar and chlorides have more outliers than other variables. citric.acid has 132 observations that equal to 0 and the distribution has another peak at 0.49. density and pH are normally distributed. Also notice that density and pH are both distributed in a very small range. residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right-skewed.
Using log transformation on the right-skewed variabled produces more normally distributed distributions.
There are 1599 observations in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality, quality_level). All variables are numeric and quality is integer. Below are some observations:
The main feature in this dataset is the quality of red wine. I will further examin the relationship of each variable with the quality and select the suitable variables to build predictive model.
There are some variables that might provide similar information, for example sulphates, free.sulfur.dioxide and total.sulfur.dioxide, three kinds of acids. I assume that there are three general groups of features that are of interest: acid(pH), alcohol and sulphates.
I created a new variable quality_level to simpify visualization.
There are six variables(residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol) that are right-skewed. I performed log transformation on the data to make them more normally distributed. The assumption for the linear regression is that variables are normally distributed. Using the transformed vairables woule be more robust when building linear regression model later.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
From the correlation table, we can see that most variables have very small correlation coefficients with quality. Variable alcohol has the highest correlation with quality. Meanwhile, volatile.acidity and sulphates have relatively higher correlation, I will further analyze three variables with red wine quality.
First we take a look at the relationship between [alcohol, volatile.acidity, sulphates] and quality. Since quality is a discrete variable, it is more straightforward to look at the boxplots. Note that alcohol and sulphates in the following analysis are log transformed.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
We can see from the medians and quartiles of each boxplot that as the percentage of alcohol increases, the score for quality also increases.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
volatile.acidity has a negative correlation with quality. Higher the volatile.acidity, lower the quality.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
sulphates has positive correlation with quality.
Besides, from the correlation matrix, fixed.acidity has high correlations with citric.acid, pH and density. free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as I suggested earlier.
Notice that by the definition of density, it is “the density of water is close to that of water depending on the percent alcohol and sugar content”. It should be strongly correlated with residual.sugar and alcohol. Nonetheless, it does have strong correlation with alcohol, but the strongest correlation is with fixed acidity.
Also, I assume the sulphates group should have strong correlation with each other earlier. But in fact, sulphates have very little correlation with free.sulfur.dioxide and total.sulfur.dioxide.
Many features have very low correlation with quality, especially residual.sugar, free.sulfur.dioxide and pH, which are near zero.
alcohol has the strongest correlation with quality, the other two feature with correlation coefficients larger than 0.25 are volatile.acidity and sulphates.
sulphates have low correlation with free.sulfur.dioxide and total.sulfur.dioxide.
density has strong correlation with fixed.acidity rather than residual.sugar and alcohol.
free.sulfur.dioxide has strong correlation with total.sulfur.dioxide.
The quality of red wine is positively correlated with percentage of alcohol and negativel correlated with volatile acidty.
To further examine the top two variables of the highest correlation with quality, I create the graph below:
High quality red wines tend to have higher alcohol values and lower volatile.acidity. Producing the same graphs for other two pairs of variables:
Given that even the largest correlation coeffient is still quite low, there is no surprise that the R-squared in our linear regression model is extremely low. Even adding all variables, the R-squared is only 0.368.
Better quality red wines have higher alcohol and sulphates values, lower volatile acidty.
Overall, none of the features have strong correlation with quality.
quality is positively correlated with sulphates but negatively with total.sulfur.dioxide. By definition, SO2 acts as an antimicrobial and antioxidant. But high-level SO2 will affect the smell and taste of red wine. We could intuitively conclude that higher the SO2, lower the quality. But sulphates is different from the sulfer dioxide level. It is a wine additive.
I created a linear regression model to predict the red wine quality. The model has very low R^2 in general. Our top three most correlated features contribute 34.6% of total variance. Adding all feature to the model, they explain 36.8% of total variance in red wine quality. Since the correlation coefficients are all quite low, they do not fit very well with the assumption of linear model (that variables have linear correlation with each other). It is not suggestive to run regression model on this dataset.
The distribution of red wine quality is nearly normal. Most red wines are among the medium quality probably because good-quality red wines are hard to produce and most customers can only afford medium-quality red wines.
Red wine quality is positively correlated with alcohol percentage and negatively with volatile acidity. The highest quality red wine has medium 12.5% of alcohol and 0.37 volatile acidity level.
Higher quality red wine has higher level of alcohol and lower level of volatile acidity. The linear relationship is not strong as the graph shows.
The red wine data set has 1599 observations which are collected in 2009. Red wines are the variants of the Portuguese “Vinho Verde” wine. The quality of red wine is scored by experts on a 0 to 10 level.
I first start by univariate analysis. Almost half of the variables are right-skewed in the dataset. I performed log transformation on these variables. Since there are some similarity among these variables, I suppose that the quality of red wine is mainly affected by three groups of indexes (sulper dioxide, alcohol and acidity). Then I continue on the bivariate and multivariate analysis. It turned out that my supposition is partially correct. The top three correlated variables are alcohol, volatile acidity and sulphates. There are some interesting correlations among the features. Nevertheless, all features have very low correlation with the red wine quality.
Given the low correlation, there is no surprise that the linear regression model performs not very well. All variables in the dataset explain only 36.8% of red wine quality total variance. The limitation of our model is very obvious. It is best not to use linear model in this case.
I think the issue mentioned above might come from our original dataset.
The dataset evaluates only one source of red wine. There are many other kinds of red wines. Including only one kind of red wine will make our dataset biased when we want to predict on other kind of red wines.
Not enough observations and variables. We can see from the correlation table that several variables are correlated. It is better if we have more indicators for the red wine. The numbers of both low and high quality red wine are quite small. The analysis based on such small number of observations might not be accurate.